Have you ever opened a syntax or script file two years after running an analysis only to find that you have no immediate memory of the code?
Have you received analysis files from a collaborator, or downloaded them from an online repository that you have never used before?
Now imagine that these files are very hard to read, or there are lots of variables being passed to arcane functions, or worse, you can’t find useful code as they are saved with meaningless file names such as:
analysis_1final_FINAL.R
onlyusethisoneforanalysis_onamonday2a.py.
If you have not - then you are one of the lucky ones! But if you have experienced it then you might know how frustrating it is to work with those files.
This chapter will highlight ways to avoid such challenges in your projects by introducing some principals of ‘code hygiene’, otherwise known as linting.
Some integrated development environments (IDEs) include automatic linting, but there are free packages and tools for linting that will lint code for you.
By keeping the following advice in mind while coding, your code will be more reusable, adaptable, and clear.
The Style Guide chapter in Data Management in Large-Scale Education Research provides examples for file naming, variable naming, and general code styling.
However, to get started quickly, the following sections present some advice for code style.
File Naming
The Centre for Open Science has some useful suggestions for the naming of files, particularly ensuring that they are readable for both humans and machines. This includes avoiding the use of wildcard characters (@£$%) and using underscores (“_”) to delimit information, and dashes (“-”) to conjunct information or spaces.
They also suggest dating or numbering files and avoiding words like FINAL (or FINAL-FINAL).
The dating suggestion is the long format YYYY-MM-DD, followed by the name of the file, and the version number.
This results in automatic, chronological order. For example:
data <-read.csv("2019-05-17_Turing-Way_Book-Dash.csv")
For more details please see the chapter on File Naming
Versioning
An extra consideration to file-naming is versioning your software.
Using versioning guidelines will help avoid using words like _FINAL.R.
A typical convention is the MajorMinorPatch (or MajorMinorRevision) approach.
In this, your first attempt at a package or library might look like: my-package_1_0_0.py
This indicates that the software is in the unrevised/patched alpha stage (0) of the first major release.
Variable naming conventions
CamelCase
lowerCamelCase
Underscore_Methods
Mixed_Case_With_Underscores
lowercase
It is important to choose one style and stick to it:
ThisIs Because_SwitchingbetweenDifferentformats is.difficult to read.
Writing clear, well commented, readable and re-usable code benefits not only you but the community (or audience) that you are developing it for.
This may be your lab, external collaborators, stakeholders, or you might be writing open source software for global distribution!
Whatever scale you work at, readability counts!
Line Length
There is some agreement on the length of the coding lines.
PEP8 suggests a maximum of 79 characters per line and 80 by the R style guide.
This means that the lines can easily fit on a screen, and multiple coding windows can be opened.
It is argued that if your line is any longer than this then your function is too complex and should be separated!
This is the crux of the Tidy method of R programming, which even has a special operator %>% which passes the previous object to the next function, so fewer characters are required:
recoded_melt_dat <-read_csv('~/files/2019-05-17_dat.csv') %>%recode() %>%melt() #We now have a recoded, melted dataframe called recoded_melt_dat
Commenting
Generally comment the “why” not “what”
The PEP8 guidelines have firm suggestions that block comments should be full sentences, have two spaces following a period, and follow a dated style guide (Strunk and White).
Inline comments should be used sparingly
Indentation
The R style guide suggests that lines should be separated:
by two spaces
And not
a mixture of tabs and spaces.
These are of course just guidelines, and you should choose elements that suit your coding style.
However, and again, it is important to ensure that you are consistent when collaborating, and can agree on a common style.
It could be useful to create a readme file describing your coding style so collaborators or contributors can follow your lead.
As mentioned earlier, there are some automatic tools that you can use to lint your code to existing guidelines.
These range from plugins for IDEs packages that ‘spell-check’ your style, and scripts that automatically lint for you.
lintr
lintr is an R package that spell-checks your code using a variety of style guidelines. It can be installed from CRAN.
The function lint takes a filename as an argument and a list of ‘linters’ that it should check your code against.
These range from whitespace conventions to checking that curly brackets do not have their lines.
The output provides a list of markers with recommendations for changing the formatting of your code line-by-line, meaning it is best used early and often in your project.
An example of how the lintr output may look like for an input file with R code.